# Load packages and import data
library(readr)
library(ggplot2)
library(dplyr)
lifts <- read_csv("https://mac-stat.github.io/data/powerlifting.csv")Simple linear regression: Visualization and Introduction (Notes)
STAT 155
Notes
Learning goals
By the end of this lesson, you should be able to:
- Visualize and describe the relationship between two quantitative variables using a scatterplot
- Write R code to create a scatterplot and compute the linear correlation between two quantitative variables
- Describe/identify weak / strong, and positive / negative correlation from a point cloud
- Build intuition for fitting lines to quantify the relationship between two quantitative variables
Readings and videos
Choose either the reading or the videos to go through after class.
- Reading: Sections 2.8, 3.1-3.3, 3.6 in the STAT 155 Notes
- Videos:
- Summarizing the Relationships between Two Quantitative Variables (Time: 12:12)
- Introduction to Linear Models (Time: 10:57)
- Method of Least Squares (Time: 5:10)
- Interpretation of Intercept and Slope (Time: 11:09)
- R Code for Fitting a Linear Model (Time: 11:07)
File organization: Save this file in the “Activities” subfolder of your “STAT155” folder.
Exercises
Context: Today we’ll explore data from a weighlifting competition. The data originally came from Kaggle and OpenPowerlifting. Our main goal will be to explore the relationship between strength-to-weight ratio (SWR) and body weight. Read in the data below.
Exercise 1: Get to know the data
Create a new code chunk by clicking the green “C” button with a green + sign in the top right of the menu bar. In this code chunk, use an appropriate function to look at the first few rows of the data.
Create a new code chunk, and use an appropriate function to learn how much data we have (in terms of cases and variables).
What does a case represent?
Navigate to the FAQ page and read the response to the “How does this site work? Do you just download results from the federations?” question. What do you learn about data quality and completeness from this response?
Exercise 2: Mutating our data
Strength-to-weight ratio (SWR) is defined as TotalKg/BodyweightKg. We can use the mutate() function from the dplyr package to create a new variable in our dataframe for SWR using the following code:
# The %>% is called a "pipe" and feeds what comes before it
# into what comes after (lifts data is "fed into" the mutate() function).
# When creating a new variable, we often reassign the data frame to itself,
# which updates the existing columns in lifts with the additional "new" column(s)
# in lifts!
lifts <- lifts %>%
mutate(NEW_VARIABLE_NAME = Age/BestSquatKg)
## Error in `mutate()`:
## ℹ In argument: `NEW_VARIABLE_NAME = Age/BestSquatKg`.
## Caused by error:
## ! object 'BestSquatKg' not foundAdapt the example above to create a new variable called SWR, where SWR is defined as TotalKg/BodyweightKg.
Exercise 3: Get to know the outcome/response variable
Let’s get acquainted with the SWR variable.
- Construct an appropriate plot to visualize the distribution of this variable, and compute appropriate numerical summaries.
- Write a good paragraph interpreting the plot and numerical summaries.
Exercise 4: Data visualization - two quantitative variables
We’d like to visualize the relationship between body weight and the strength-to-weight ratio. A scatterplot (or informally, a “point cloud”) allows us to do this! The code below creates a scatterplot of body weight vs. SWR using ggplot().
# scatterplot
# The alpha = 0.5 in geom_point() adds transparency to the points
# to make them easier to see. You can make this smaller for more transparency
lifts %>%
ggplot(aes(x = BodyweightKg, y = SWR)) +
geom_point(alpha = 0.5)
## Error in `geom_point()`:
## ! Problem while computing aesthetics.
## ℹ Error occurred in the 1st layer.
## Caused by error:
## ! object 'SWR' not found- This is your first bivariate data visualization (visualization for two variables)! What differences do you notice in the code structure when creating a bivariate visualization, compared to univariate visualizations we’ve worked with before?
- What similarities do you notice in the code structure?
- Does there appear to be some sort of pattern in the structure of the point cloud? Describe it, in no more than three sentences! Comment on the direction of the relationship between the two variables (positive? negative?) and the spread of the points (are they dispersed? close together?).
Exercise 5: Scatterplots - patterns in point clouds
Sometimes, it can be easier to see a pattern in a point cloud by adding a smoothing line to our scatterplots. The code below adapts the code in Exercise 4 to do this:
# scatterplot with smoothing line
lifts %>%
ggplot(aes(x = BodyweightKg, y = SWR)) +
geom_point(alpha = 0.5) +
geom_smooth()
## Error in `geom_point()`:
## ! Problem while computing aesthetics.
## ℹ Error occurred in the 1st layer.
## Caused by error:
## ! object 'SWR' not found- Look back at your answer to Exercise 4 (c). Does the smoothing line assist you in seeing a pattern, or change your answer at all? Why or why not?
- Based on the scatterplot with the smoothing line added above, does there appear to be a linear relationship between body weight and SWR (i.e. would a straight line do a decent job at summarizing the relationship between these two variables)? Why or why not?
Exercise 6: Correlation
We can quantify the linear relationship between two quantitative variables using a numerical summary known as correlation (sometimes known as a “correlation coefficient” or “Pearson’s correlation”). Correlation can range from -1 to 1, where a correlation of 0 indicates that there is no linear relationship between the two quantitative variables.
Below is an example of a “Math Box”. You’ll see these occasionally throughout the activities. You are not required to memorize, nor will you be assessed on, anything in the math boxes. If you plan on continuing with Statistics courses at Macalester (or are interested in the math behind everything!), these math boxes are for you!
The Pearson correlation coefficient, \(r_{x, y}\), of \(x\) and \(y\) is the (almost) average of products of the z-scores of variables \(x\) and \(y\):
\[ r_{x, y} = \frac{\sum z_x z_y}{n - 1} \]
In general, we will want to be able to describe (qualitatively) two aspects of correlation:
- Strength
- Is the correlation between x and y strong, or weak, i.e. how closely do the points fit around a line? This has to do with how dispersed our point clouds are.
- Direction
- Is the correlation between x and y positive or negative, i.e. does y go “up” when x goes “up” (positive), or does y go “down” when x goes “up” (negative)?
Stronger correlations will be further from 0 (closer to -1 or 1), and positive and negative correlations will have the appropriate respective sign (above or below zero).
- Rather than a smooth trend line, we can force the line we add to our scatterplots to be linear using
geom_smooth(method = 'lm'), as below:
# scatterplot with linear trend line
lifts %>%
ggplot(aes(x = BodyweightKg, y = SWR)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm")
## Error in `geom_point()`:
## ! Problem while computing aesthetics.
## ℹ Error occurred in the 1st layer.
## Caused by error:
## ! object 'SWR' not found- Based on the above scatterplot, how would you describe the correlation between body weight and SWR, in terms of strength and direction?
- Make a guess as to what numerical value the correlation between body weight and SWR will have, based on your response to part (b).
Exercise 7: Computing correlation in R
We can compute the correlation between body weight and SWR using summarize and cor functions:
# correlation
# Note: the order in which you put your two quantitative variables into the cor
# function doesn't matter! Try switching them around to confirm this for yourself
# Because of the missing data, we need to include the use = "complete.obs" - otherwise the correlation would be computed as NA
lifts %>%
summarize(cor(SWR, BodyweightKg, use = "complete.obs"))
## Error in `summarize()`:
## ℹ In argument: `cor(SWR, BodyweightKg, use = "complete.obs")`.
## Caused by error:
## ! object 'SWR' not foundIs the computed correlation close to what you guessed in Exercise 6 part (c)?
Exercise 8: Limitations of correlation
We previously noted that correlation was a numerical summary of the linear relationship between two variables. We’ll now go through some examples of relationships between quantitative variables to demonstrate why it is incredibly important to visualize our data in addition to just computing numerical summaries!
For this exercise, we’ll be working with the anscombe dataset, which is built in to R. To load this dataset into our environment, we run the following code:
# load anscombe data
data("anscombe")The anscombe dataset contains four different pairs of quantitative variables:
x1,y1x2,y2x3,y3x4,y4
Adapt the code we used in Exercise 7 to compute the correlation between each of these four pairs of variables, below:
# correlation between x1, y1
# correlation between x2, y2
# correlation between x3, y3
# correlation between x4, y4What do you notice about each of these correlations (if the answer to this isn’t obvious, double-check your code)?
Describe these correlations in terms of strength and direction, using only the numerical summary to assist you in your description.
Draw an example on the white board or at your tables of what you think the point clouds for these pairs of variables might look like. There are only 11 observations, so you can draw all 11 points if you’d like!
Adapt the code for scatterplots given previously in this activity to make four distinct scatterplots for each pair of quantitative variables in the
anscombedataset. You do not need to add a smooth trend line or a linear trend line to these plots.
# scatterplot: x1, y1
# scatterplot: x2, y2
# scatterplot: x3, y3
# scatterplot: x4, y4- Based on the correlations you calculated and scatterplots you made, what is the message of this last exercise as it relates to the limits of correlation?
Reflection
Much of statistics is about making (hopefully) reasonable assumptions in attempt to summarize observed relationships in data. Today we started considering assumptions of linear relationships between quantitative variables.
Review the learning objectives at the top of this file and today’s activity. How do you imagine assumptions of linearity might be useful in terms of quantifying relationships between quantitative variables? How do you imagine these assumptions could sometimes fall short, or even be unethical in certain cases?
Response: Put your response here.
Discovery - Intuition for Least Squares
Exercise 9: Lines of best fit
In this activity, we’ve learned how to fit straight lines to data, to help us visualize the relationship between two quantitative variables. So far, ggplot has chosen the line for us. How does it know which line is “best”, and what does “best” even mean?
For this exercise, we’ll consider the relationship between x1 and y1 in the anscombe dataset. Run the following code, which creates a scatterplot with a fitted line to our data using the function geom_abline:
# scatterplot with a fitted line, whose slope is 0.4 and intercept is 3
anscombe %>%
ggplot(aes(x = x1, y = y1)) +
geom_point() +
geom_abline(slope = 0.4, intercept = 3, col = "blue", size = 1)Describe the line that you see. Do you think the line is “good”? What are you using to define “good”?
Some things to think about:
- How many points are above the line?
- How many points are below the line?
- Are the distances of the points above and below the line roughly similar, or is there meaningful difference?
Now we’ll add another line to our plot. Which line do you think is better suited for this data? Why? Be specific!
# scatterplot with a fitted line, whose slope is 0.4 and intercept is 3
anscombe %>%
ggplot(aes(x = x1, y = y1)) +
geom_point() +
geom_abline(slope = 0.4, intercept = 3, col = "blue", size = 1) +
geom_abline(slope = 0.5, intercept = 4, col = "orange", size = 1)It’s usually quite simple to note when a line is bad, but more difficult to quantify when a line is a good fit for our data. Consider the following line:
# scatterplot with a fitted line, whose slope is 0.4 and intercept is 3
anscombe %>%
ggplot(aes(x = x1, y = y1)) +
geom_point() +
geom_abline(slope = -0.5, intercept = 10, col = "red", size = 1) In the next activity, we’ll formalize the principle of least squares, which will give us one particular definition of a line of best fit that is commonly used in statistics! We’ll take advantage of the vertical distances between each point and the fitted line (residuals), which will help us define (mathematically) a line that best fits our data:
library(broom)
anscombe %>%
lm(y1 ~ x1, data = .) %>%
augment() %>%
ggplot(aes(x = x1, y = y1)) +
geom_smooth(method = "lm", se = FALSE) +
geom_segment(aes(xend = x1, yend = .fitted), col = "red") +
geom_point()Additional Practice
Exercise 10: Correlation and extreme values
In this exercise, we’ll explore how correlation changes with the addition of extreme values, or observations. We’ll begin by generating a toy dataset called dat with two quantitative variables, x and y. Run the code below to create the dataset.
while not required, recall that you can look up function documentation in R using the ? in front of a function name to figure out what that function is doing!
# create a toy dataset
set.seed(1234)
x <- rnorm(100, mean = 5, sd = 2)
y <- -3 * x + rnorm(100, sd = 4)
dat <- data.frame(x = x, y = y)- Make a scatterplot of
xvs.y.
# scatterplot- Based on your scatterplot, describe the correlation between
xandyin terms of strength and direction. - Guess the correlation (the numerical value) between
xandy. - Compute the correlation between
xandy. Was your guess from part (c) close?
# correlation- Suppose we observe an additional observation with
x = 15andy = -45. We can create a new data frame,dat_new1, that contains this observation in addition to the original ones as follows:
# creating dat_new1
x1 <- c(x, 15)
y1 <- c(y, -45)
dat_new1 <- data.frame(x = x1, y = y1)- Make a scatterplot of
xvs.yfor this new data frame, and compute the correlation betweenxandy. Did your correlation change very much with the addition of this observation? Hypothesize why or why not.
# scatterplot
# correlation- Suppose instead of our additional observation having values
x = 15andy = -45, we instead observex = 15andy = -15. We can create a new data frame,dat_new2, that contains this observation in addition to the original ones as follows:
# creating dat_new1
x2 <- c(x, 15)
y2 <- c(y, 45)
dat_new2 <- data.frame(x = x2, y = y2)- Make a scatterplot of
xvs.yfor this new data frame, and compute the correlation betweenxandy. Did your correlation change very much with the addition of this observation? Hypothesize why or why not.
# scatterplot
# correlation- What do you think the takeaway message is of this exercise?
- Challenge Add linear trend lines to your scatterplots from parts (f) and (h). Does this give you any additional insight into why the correlations may have changed in different ways with the addition of a new observation?
Done!
- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer – check that your work translated correctly; and (3) Outside RStudio, navigate to your ‘Activities’ subfolder within your ‘STAT155’ folder and locate the HTML file – you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the ‘Stop’ button in the ‘Background Jobs’ pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework!